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Abstract 

This report describes an analytical model for the access and cycle times of 
direct- mapped and set-associative caches. The inputs to the model are the cache 
size, block size, and associativity, as well as array organization and process 
parameters. The model gives estimates that are within 10% of Hspice results for 
the circuits we have chosen. 

Software implementing the model is available from DEC WRL. 
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1. Introduction 



Most computer architecture research involves investigating trade-offs between various alter- 
natives. This can not adequately be done without a firm grasp of the costs of each alternative. 
As an example, it is impossible to compare two different cache organizations without consider- 
ing the difference in access or cycle times. Similarly, the chip area and power requirements of 
each alternative must be taken into account. Only when all the costs are considered can an in- 
formed decision be made. 

Unfortunately, it is often difficult to determine costs. One solution is to employ analytical 
models that predict costs based on various architectural parameters. In the cache domain, both 
access time models [8] and chip area models [5] have been published. In [8], Wada et al. present 
an equation for the access time of a cache as a function of various cache parameters (cache size, 
associativity, block size) as well as organizational and process parameters. In [5], Mulder et al. 
derive an equation for the chip area required by a cache using similar input parameters. 

This report describes an extension of Wada' s model. Some of the new features are: 

• an additional array organizational parameter 

• improved decoder and wordline models 

• pre-charged and column-multiplexed bitlines 

• a tag array model with comparator and multiplexor drivers 

• cycle time expressions 

The goal of this work was to derive relatively simple equations that predict the access/cycle 
times of caches as a function of various cache parameters, process parameters, and array or- 
ganization parameters. The cache parameters as well as the array organization parameters will 
be discussed in Section 4. The process parameters will be introduced as they are used; Appendix 
II contains the values of the process parameters for a 0.8(im CMOS process [3]. 

Any model needs to be validated before the results generated using the model can be trusted. 
In [8], a Hspice model of the cache was used to validate their analytical model. The same ap- 
proach was used here. Of course, this only shows that the model matches the Hspice model; it 
does not address the issue of how well the assumed cache structure (and hence the Hspice 
model) reflects a real cache design. When designing a real cache, many circuit tricks could be 
employed to optimize certain stages in the critical path. Nevertheless the relative access times 
between different configurations should be more accurate than the absolute access times, and this 
is often more important for optimization studies. 

The model described in this report has been implemented, and the software is available from 
DEC WRL. Section 2 explains how to obtain and use the software. The remainder of the report 
explains how the model was derived. For the user who is only interested in using the model, 
there is no need to read beyond Section 2. 
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2. Obtaining and Using the Software 

A program that implements the model described in this report is available. To obtain the 
software, log into gatekeeper.dec.com using anonymous ftp. (Use "anonymous" as the login 
name and your machine name as the password.) The files for the program are stored together in 
"/archive/pub/DEC/cacti.tar.Z". Get this file, "uncompress" it, and extract the files using "tar". 

The program consists of a number of C files; time.c contains the model. Transistor widths and 
process parameters are defined in def.h. A makefile is provided to compile the program. 

Once the program is compiled, it can be run using the command: 
cacti C B A 

where C is the cache size (in bytes), B is the block size (in bytes), and A is the associativity. The 
output width and internal address width can be changed in def.h. 

When the program is run, it will consider all reasonable values for the array organization 
parameters (discussed in Section 4) and choose the organization that gives the smallest access 
time. The values of the array organization parameters chosen are included in the output report. 

3. Cache Structure 

Before describing the model, the internal structure of an SRAM cache will be briefly 
reviewed. Figure 1 shows the assumed organization. The decoder first decodes the address and 
selects the appropriate row by driving one wordline in the data array and one wordline in the tag 
array. Each array contains as many wordlines as there are rows in the array, but only one 
wordline in each array can go high at a time. Each memory cell along the selected row is as- 
sociated with a pair of bitlines; each bitline is initially precharged high. When a wordline goes 
high, each memory cell in that row pulls down one of its two bitlines; the value stored in the 
memory cell determines which bitline goes low. 

Each sense amplifier monitors a pair of bitlines and detects when one changes. By detecting 
which line goes low, the sense amplifier can determine what is in the memory cell. It is possible 
for one sense amplifier to be shared among several pairs of bitlines. In this case, a multiplexor is 
inserted before the sense amps; the select lines of the multiplexor are driven by the decoder. The 
number of bitlines that share a sense amplifier depends on the layout parameters described in the 
next section. Section 6.4 discusses this further. 

The information read from the tag array is compared to the tag bits of the address. In an 
A-way set-associative cache, A comparators are required. The results of the A comparisons are 
used to drive a valid (hit/miss) output as well as to drive the output multiplexors. These output 
multiplexors select the proper data from the data array (in a set- associative cache or a cache in 
which the data array width is larger than the output width), and drive the selected data out of the 
cache. 

It is important to note that there are two potential critical paths in a cache read access. If the 
time to read the tag array, perform the comparison, and drive the multiplexor select signals is 
larger than the time to read the data array, then the tag side is the critical path, while if it takes 
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Figure 1: Cache structure 



longer to read the data array, then the data side is the critical path. Since either side could deter- 
mine the access time, both must be modeled in detail. 



4. Cache and Array Organization Parameters 

The following cache parameters are used as inputs to the model: 

• C: Cache size in bytes 

• B: Block size in bytes 

• A: Associativity 

• b 0 : Output width in bits 



• b addr : Address width in bits 
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In addition, there are six array organization parameters. In the basic organization discussed by 
Wada [8], a single set shares a common wordline. Figure 2-a shows this organization, where B 
is the block size (in bytes), A is the associativity, and S is the number of sets (S = ^). Clearly, 

such an organization could result in an array that is much larger in one direction than the other, 
causing either the bitlines or wordlines to be very slow. This could result in a longer-than- 
necessary access time. To alleviate this problem, Wada describes how the array can be broken 
horizontally and vertically and defines two parameters, N d , and N dbl which indicates to what 
extent the array has been divided. The parameter N dwl indicates how many times the array has 
been split with vertical cut lines (creating more, but shorter, wordlines), while N dbl indicates how 
many times the array has been split with horizontal cut lines (causing shorter bitlines). The total 
number of subarrays is N dw[ x N dbl . 

Figure 2-b introduces another organization parameter, N spd . This parameter indicates how 
many sets are mapped to a single wordline, and allows the overall access time of the array to be 
changed without breaking it into smaller subarrays. 



8xBxA 




1 6xBxA 



S/2 = 




a) Original Array 



b) Nspd = 2 



Figure 2: Cache organization parameter N spd 

The optimum values of N dwl , N dbl , and N spd depend on the cache and block sizes, as well as 
the associativity. 

Notice that increasing these parameters is not free in terms of area. Increasing N dbl or iV ^ 
beyond one increases the number of sense amplifiers required, while increasing N dw i means 
more wordline drivers are required. Except in the case of a direct-mapped cache with the block 
length equal to the processor word length and all three parameters equal to one, a multiplexor is 
required to select the appropriate sense amplifier output to return to the processor. Increasing 
N dbl or N spd increases the size of the multiplexor. 



Using these organizational parameters, each subarray contains 



SxBxAxN 
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columns and 



BxAxN„,xN , 

dbl spd 



rows. 
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We assume that the tag array can be broken up independently of the data array. Thus, there 
are also three tag array parameters: N^, N tbl , and N tspd . 

5. Methodology 

The analytical model in this paper was obtained by decomposing the circuit into many equiv- 
alent RC circuits, and using simple RC equations to estimate the delay of each stage. This sec- 
tion shows how resistances and capacitances were estimated, as well as how they were combined 
and the delay of a stage calculated. 



5.1. Equivalent Resistances 

The equivalent resistance seen between drain and source of a transistor depends on how the 
transistor is being used. For each type of transistor (p and n), we will need two resistances: 
full- on and switching. 

5.1.1. Full-on Resistance 

The full-on resistance is the resistance seen between drain and source of a transistor assuming 
the gate voltage is constant and the gate is fully conducting. This resistance can be used for 
pass-transistors that (as far as the critical path is concerned) are always conducting. Also, this is 
the resistance that is used in the Horowitz approximation discussed in Section 5.5. 

It was assumed that the equivalent resistance of a conducting transistor is inversely propor- 
tional to the transistor width (only minimum- length transistors were considered). The equivalent 
resistance of any transistor can be estimated by: 

and R p,on are 1 

two parameters in a 0.8(im CMOS process. 



where R n on and R p on are technology dependent constants. Appendix II shows values for these 



5.1.2. Switching Resistance 

This is the effective resistance of a pull-up or pull-down transistor in a switching static gate. 
For the most part, our model uses an inverter approximation due to Horowitz (see Section 5.5) to 
model such gates, but a simpler method using the static resistance is used to estimate the 
wordline driver size and the precharge delay. 

Again, we assume the equivalent resistance of a conducting transistor is inversely proportional 
to the transistor width. Thus, 

. , n,switching 

^switching ( W ) = w (2) 



W 

res p,switching(W) - 



p 

p,switching 



w 
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where R n ^ sw i tc hi ng and R p ^ sw i tc hi ng are technology dependent constants (see Appendix II). The 
values shown in Appendix II were measured using Hspice simulations with equal input and out- 
put transition times. 

5.2. Gate Capacitances 

The gate capacitance of a transistor consists of two parts: the capacitance of the gate, and the 
capacitance of the polysilicon line going into the gate. If £ ej ^is the effective length of the tran- 
sistor, L poly is the length of the poly line going into the gate, C gate is the capacitance of the gate 
per unit area, and C polywire is the poly line capacitance per unit area, then a transistor of width W 
has a gate capacitance of: 

gatecap = WxL gff x C gate + L poly x L eff x C polywire 
The same formula holds for both NMOS and PMOS transistors. 

The value of C gate depends on whether the transistor is being used as a pass transistor, or as a 
pull-up or pull-down transistor in a static gate. Thus, two equations for the gate capacitance are 
required: 

gatecap ( W, L poly ) = W X L gff x C gate + L poly x L gff x C polywire (3) 

gatecap pass {W,L p()ly ) = WxL gff x C gatepass + L poly x L eff x C polywire 

Values for C gate , C gatepass , C polywire , and L e# are shown in Appendix II. A different L poly was 
used in the model for each transistor. This L poly was chosen based on typical poly wire lengths 
for the structure in which it is used. 

5.3. Drain Capacitances 

Figures 3 and 4 show typical transistor layouts for small and large transistors respectively. 
We have assumed that if the transistor width is larger than 10(im, the transistor is split as shown 
in Figure 4. 



Leff 













DRAIN 




SOURCE 












3 x Leff 





w 



Figure 3: Transistor geometry if width < lOurn 

The drain capacitance is composed of both an area and perimeter component. Using the 
geometries in Figures 3 and 4, the drain capacitance for a single transistor can be obtained. If the 
width is less than 10(im, 
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W/2 



Figure 4: Transistor geometry if width >= lOum 

draincap(W) = 3L eff xWx C diffarea + (6L eff+ W)x C diffside + Wx C dlffgate 

where c diffarea> C diffside' and C diffgate WQ process dependent parameters (there are two values for 
each of these: one for NMOS and one for PMOS transistors). C di ^ gate includes the junction 
capacitance at the gate/diffusion edge as well as the oxide capacitance due to the gate/source or 
gate/drain overlap. Values for n-channel and p-channel C di j^ gate are also given in Appendix II. 

If the width is larger than lOum, we assume the transistor is folded (see Figure 4), reducing 
the drain capacitance to: 



W 



draincap{W) = 3L eff x-x C diffarea t u u eff * ^ diffside t yy ^ ^ diffgate 



+ 6 L„rr x Cm* 



+ Wx C, :ft 



Now, consider two transistors (with widths less than lOum) connected in series, with only a 
single L^W wide region acting as both the source of the first transistor and the drain of the 
second. If the first transistor is on, and the second transistor is off, the capacitance seen looking 
into the drain of the first is: 

draincap(W) = 4 L gff x Wx C dlffarea + (8 L gff+ W)x C dlffside + 3fx C dlffgate 

Figure 5 shows the situation if the transistors are wider than lOum. In this case, the 
capacitance seen looking into the drain of the inner transistor (x in the diagram) assuming it is on 
but the outer transistor is off is: 

W 

draincap(W) = 5L gff x-x C diffarea + 10 L gff x C diffside + 3Wx C diffgate 



To summarize, the drain capacitance of x stacked transistors is: 
if W<\0\im 

draincap n (W,x) =3 L eff x Wx C n diffarea + (6L eff+ W)x C n diffside + Wx C n diffgate + 
(x- 1 ) x {L eff x Wx C ndiffarea + 2L £ff x C ndiffside + 2 Wx C ndiffgate ] 



(4) 



draincap n (W,x) =3 L £ ^X Wx C. 



^p,diffarea 



+ {6L eff +W)xC i 



p,diffside 



+ WxC 



pMffgate 



(x- 1 ) x {L eff x Wx C p4iffarea + 2L eff x C p4mde + 2 Wx C p4iffgate ] 
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Figure 5: Two stacked transistors if each width >= lOum 

if w>= 10 \im 

draincap n (W,x)=3L eff xW/2xC ndiffarea + 6L eff xC ndiffside + WxC ndiffgate + 

(x- 1 ) x {L eff x Wx C ndiffarea + AL eff x C ndmde + 2 Wx C ndiffgate ) 

drainca Pp (W,x)=3L eJf xW/2xC pdiffarea + 6L eff xC pdiffside + WxC pdiffgate + 

(x- 1 ) x {L eff x Wx C p>dmrea + AL eff x C p4iffside + 2 Wx C p4iffgate ] 



5.4. Other Parasitic Capacitances 

Other parasitic capacitances such as metal wiring are modeled using the values for ^ fmeto / and 
^wordmetal gi ven m Appendix EE. These capacitance values are fixed values per unit length in 
terms of the RAM cell length and width. These values include an expected value for the area 
and sidewall capacitances to the substrate and other layers. Besides being used for parasitic 
capacitance estimation of the bitlines and wordlines themselves, they are also used to model the 
capacitance of the predecode lines, data bus, address bus, and other signals in the memory. Al- 
though the capacitance per unit length would probably less for many of these busses than for the 
bit lines and word lines, the same value is used for simplicity of modeling. 



5.5. Horowitz Approximation 

In [2], Horowitz presents the following approximation for the delay of a static inverter with a 
rising input: 

dela y ri se ( '/> trise ^th) = tf ^ l v thV 2 + 2 'rise b ^- v th) 1 '/ 

where v th is the switching voltage of the inverter (as a fraction of the maximum voltage), t rise is 
the input rise time, fyis the output time constant (assuming a step input), and b is the fraction of 
the swing in which the input affects the output (we used £=0.5). 
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For a falling input with a fall time of tp the above equation becomes: 

dda y fall^p 1 falV V th) = '/V(log[l-V^]) 2 ' + 2t fall bv th l 't f 

In this case, we used b=0A. 

The delay of an inverter is defined as the time between the input reaching the switching volt- 
age (also called threshold voltage) of the inverter and the output reaching the switching voltage 
of the following gate. If the inverter drives a gate with a different switching voltage, the above 
equations need to be modified slightly. If the switching voltage of the switching gate is v th -^ and 
the switching voltage of the following gate is v th2 , then: 

de/ «W'/'W v M'W = t f^( lo S[v thl ]) 2 + 2t rise b(\-V M )/t f + (5) 
'/(!°g[v M ]-log[v f;!2 ]) 



dela yjall ( - t f' t fi>U' V thl' V tK2) = t f^( l °S^-v thl ]) 2 + 2t fall bv M /t f + 
t f (\og[l-v thl ]-log[l-v th2 ]) 

6. Delay Model 

This section derives the cache read access and cycle time model. From Figure 1, the follow- 
ing components can be identified: 

• Decoder 

• Wordlines (in both the data and tag arrays) 

• Bitlines (in both the data and tag arrays) 

• Sense Amplifiers (in both the data and tag arrays) 

• Comparators 

• Multiplexor Drivers 

• Output Drivers (data output and valid signal output) 

The delay of each these components will be estimated separately (Sections 6.1 to 6.10), and 
will then be combined to estimate the access and cycle time of the entire cache (Section 6.1 1). 

6.1. Decoder 

6.1.1. Decoder Architecture 

Figures 6 and 7 show the decoder architecture. It is assumed that each subarray has its own 
decoder; therefore, there are NdwlxNdbl decoders associated with the data array, and 
Ntwl x Ntbl tag array decoders. One driver drives all the data array decoders, while another 
drives the tag array decoders. 

The decoder in Figure 7 contains three stages. Each block in the first stage takes three address 
bits (in true and complement), and generates a l-of-8 code. This can be done with 8 NAND 
gates. Since there are 
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Figure 7: Single decoder structure 



bits that must be decoded, the number of 3-to-8 blocks required is simply: 



N. 



3to8 



(6) 
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(if the number of address bits is not divisible by three, then l-of-4 codes can be used to make up 
the difference, but this was not modeled here). 

These l-of-8 codes are combined using NOR gates in the second stage. One NOR gate is 
required for each of the — — rows in the subarray. Each NOR gate must take one input 

dXAX/V ,, ,X/V , 
am spa 

from each 3-to-8 block; thus, each NOR gate has A^ 3fo8 inputs (where A^ 3fo8 was given in Equa- 
tion 6). 

The final stage is an inverter that drives each wordline driver. 
6.1.2. Decoder Delay 

Figure 8 shows a transistor-level diagram of the decoder. The decoder delay is the time after 
the input passes the threshold voltage of the decoder driver until norout reaches the threshold 
voltage of the final inverter (before the wordline driver). Notice that the delay does not include 
the time for the inverter to drive the wordline driver; this delay depends on the size of the 
wordline driver and will be considered in Section 6.2. 

Since, in many caches, decbus will be precharged before a cache access, the critical path will 
include the time to discharge decbus. This occurs after nandin rises, which in turn, is caused by 
address bits (or their inverses) falling. Once decbus has been discharged, norout will rise, and 
after another inverter and the wordline driver, the selected wordline will rise. 

Only one path is shown in the diagram; the extra inputs to the NAND gates are connected to 
other outputs of the decoder driver, and the extra inputs to the NOR gates are connected to the 
outputs of other NAND gates. The worst case for both the NAND and NOR stages occurs when 
all inputs to the gate change. This is the case that will be considered when estimating the 
decoder delay. 



6.1.3. Input Fall Time 

The delay of the first stage depends on the fall time of the input. To estimate a reasonable 
value for the input fall time, two inverters in series as shown in Figure 9 are considered. Each 
inverter is assumed to be the same size as the decoder driver (the first inverter in Figure 8). 

The Horowitz inverter approximation of Equation 5 is used to estimate the delay of each in- 
verter (and hence the output rise time). The time constant, U of the first stage is R eq xC eq where 
R eq is the equivalent resistance of the pull-up transistor in the inverter (the full-on resistance, as 
described in Section 5.1) and C is the drain capacitance of the two transistors in the first in- 
verter stage plus the gate capacitance of the two transistors in the second stage (Sections 5.3 and 
5.2 show how these can be calculated). The input fall time of the first stage is assumed to be 0 (a 
step input), and the initial and final threshold voltages are the same. Thus, the delay of the first 
inverter can be written using the notation in Section 5 as: 

T l = delay fall ( t f , 0 , v thdecdrive , v thdecdrive ) 

where 

f /= res p,on( W decdrivep) X 

( draincap n ( W decdriven , 1 ) + draincap p ( W decdrivep , 1 ) + 

gatecap ( W decdrivm + W decdrivep )) 
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Figure 8: Decoder critical path 




in 



Figure 9: Circuit used to estimate reasonable input fall time 

In the above equation, the widths of the transistors in the inverter transistors are denoted by 
W decdriven and W decdrivep and the threshold (switching) voltage of the inverter is denoted by 
v thdecdrive- Appendix I shows the assumed sizes and threshold voltages for each gate on the 
critical path of the cache. 

From the above equation, the rise time to the second stage can be approximated as . 
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The second stage can be worked out similarly: 

T 2 - delay rise (tp- , v tMecdrive , v thdecdrive ) 

thdecdrive 

From this, the fall time of the second inverter output (and hence a reasonable fall time for the 
cache input) can be written as: 

infalltime = (7) 

v thdecdrive 

Note that the above expressions for Tj and T 2 will not be included in the cache access time; 
they were only derived to estimate a reasonable input fall time (Equation 7). 



6.1.4. Stage 1: Decoder Driver 

This section estimates the time for the first inverter in Figure 8 to drive the NAND inputs. 
Each inverter drives dwl*N dbl NAND gates (recall that both address and address-bar are as- 
sumed to be available; thus, each driver only needs to drive half of the NAND gates in its 3-to-8 
block). 




T 



nandin 
C 



eq 



Figure 10: Decoder driver equivalent circuit 

Figure 10 shows a simplified equivalent circuit. In this figure, R eq is the equivalent pull-up 
resistance of the driver transistor plus the resistance of the metal lines used to tie the NAND 
outputs to the NOR inputs. The wire length can be approximated by noting that the total edge 
length of the memory is approximately SxBxAxN dbl xN spd cells. If the memory arrays are 
grouped in two-by-two blocks, and if the predecode NAND gates are at the center of each group, 
then the connection between the driver and the NAND gate is one quarter of the sum of the array 
widths (see Figure 1 1). In large memories with many groups the bits in the memory are arranged 
so that all bits driving the same data output bus are in the same group, reducing the required 
length of the data bus. 

Thus, if R wordmeta i is the approximate resistance of a metal line per bit width, then R eq is: 

R (W ^ R SBAN dblN spd 

K eq res p,on ( " decdrivep > + K wordmetal X § 

Note that we have divided the R wor( j meta i term by an additional two; the first order approximation 
for the delay at the end of a distributed RC line is RC/2 (we assume the resistance and 
capacitance are distributed evenly over the length of the wire). 
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Figure 11: Memory block tiling assumptions 



The equivalent capacitance C eq in Figure 10 can be written as: 

C eq = draincap p ( W decdrivep , 1 ) + draincap n ( W decdrivm , 1 ) + 

4N dwl N dblSatecap(W dec3toSn + W deMp ,10)+ 2BAN dbl N spd C wordmetal 

where C wordmetal is the metal capacitance of a metal wire per bit width. 
Using R and C eq , the delay can be estimated as: 

T dec,l = dela y fall( C eq R eq > mfalltime , v tMecdrive , v thdeM ) 
where infalltime is from Equation 7. 



(8) 



6.1.5. Stage 2: NAND Gates 

This section estimates the time required for a NAND gate to discharge the decoder bus (and 
the inputs to the NOR gates). The equivalent circuit is shown in Figure 12. In this diagram, R 



eq 



decbus 



eq 



'eq 



Figure 12: Decoder driver equivalent circuit 
is the equivalent resistance of the three pull-down transistors (in series). The total resistance is 
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approximated by 3res n or fWdec?>to%n)- Since all three inputs are changing simultaneously (in the 
worst case), each transistor has about the same resistance. In our CMOS 0.8um process, this 
approximation induces an error of about 10%-20%. The resistance R also includes the metal 

resistance of the lines connecting the NAND to the NOR gate. Since there are rows in 

ttAN dbl"spd 

the subarray, 

C 

R eq ~ 3 res n,on ^dec3toSr) + R bitmetal X 2BAN N 

where R},i tmeta i is the metal resistance per bit height. 

The capacitance C is the sum of the input capacitances of ^ - NOR gates, the drain 

" dbl spd 

capacitances of the NAND driver, and the wire capacitance. Thus, 

C eq = 3 draincap p ( W deMp , 1 ) + draincap n ( W deMn , 3 ) + 

^AN dbl N spd X ^eca P {W decnom+ W decnorp ,10) + jBA^N^f C ^ 
The delay of this stage is given by: 

^decA 

T dec,2 ~ dela y r ise ( R eq X C eq ' " ' ' v thdec3toS ' v thdecnor^ ( 9 ) 

v thdec3toS 

where T dec j is from Equation 8. 

6.1.6. Stage 3: NOR Gates 

The final part of the decoder delay is the time for a NOR gate to drive norout high. An 
equivalent circuit similar to that of Figure 10 can be used. In this case, the pull-up resistance of 
the NOR gate is approximated by N^ to ^xres p {W decnor ^) where iV 3to8 is the number of inputs to 
each NOR gate (from Equation 6). The capacitance C eq is 



C eq = N, 



3fo8 draincap n ( W decnorn , 1 ) + draincap p ( W decnorp , N 3toS ) 



gatecap(W decinvn + W decinvp ) 
Then, 

T dec2 

T dec,3 ~ dela yfall( R eq XC eq ' ~T~, ' ' V thdecnor > v thdecinv^ ( 10 ) 

thdecnor 

where 7j 2 i s fr° m Equation 9. Note that the value of v thdecnor depends on the number of 
inputs to each NOR gate (Appendix I contains several values for v thdecnor ). 

6.1.7. Total decoder delay 

By adding equations 8 to 10, the total decoder delay can be obtained: 

T decoder, data ~ ^ dec A + ^dec,2 + ^ dec,3 (H) 
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6.1.8. Analytical vs. Hspice Results 

Figure 13 shows the decoder delay predicted by Equation 11 (solid lines) as well as the delay 
predicted by Hspice (dotted lines). The transistor sizes used in the Hspice model are shown in 
Appendix I and the technology parameters used are shown in Appendix II. The Hspice deck 
used in this (and all other graphs in this paper) models an entire cache; this ensures that the input 
slope and output loading effects of each stage are properly modeled. 

The horizontal axis of Figure 13 is the number of rows in each subarray (which is ). 

BAN dbl"spd 

The results are shown for one and eight subarrays. The analytical and Hspice results are in good 
agreement. The step in both results is due to a change from 3-input to 4-input NOR gates in the 
final decode when moving from 9 address bits to 10 address bits. 



8ns 
7ns 
6ns 
5ns 

Decoder 
Delay 4ns 

3ns 

2ns 

Ins 

0ns 

0 200 400 600 800 

Rows in Each Array 

Figure 13: Decoder delay 

6.1.9. Tag array decoder 

The equations derived above can also be used for the tag array decoder. The only difference is 
that N dwl , N dbl , and N spd should be replaced by their tag array counterparts. 
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6.2. Wordlines 



6.2.1. Wordline Architecture 

Figure 14 shows the wordline driver driving a wordline. The two inverters are the same as the 
final two inverters in Figure 8 (recall that the decoder equations do not include the time to dis- 
charge the decoder output). 



norout 



I 



Wordline 
Driver . 

? T 



8xBxAx Nspd Bjfs 
Ndwl 



|_ decout |£ word 



ITT 



Figure 14: Word line architecture 

In Wada's access time model, it was assumed that wordline drivers are always the same size, 
no matter how many columns are in the array. In this model, however, the wordline driver is 
assumed to get larger as the wordline gets longer. Normally, a cache designer would choose a 
target rise time, and adjust the driver size appropriately. Rather than assuming a constant rise 
time for caches of all sizes, however, we assume the desired rise time (to a 50% word line swing) 
is: 

rise time = k rise x In (cols) x 0.5 
where 



cols - 



SB AN, 



spd 



N 



dwl 



The constant k rise is a constant that depends on the implementation technology. To obtain the 
transistor size that would give this rise time, it is necessary to work backwards, using an equiv- 
alent RC circuit to find the required driver resistance, and then finding the transistor width that 
would give this resistance. 



T 



word 
C 



eq 



Figure 15: Equivalent circuit to find width of wordline driver 
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Figure 15 shows the equivalent circuit for the wordline. The pull-up resistance can be ob- 
tained using the following RC equation 

r p = r - ri ; etime (12) 

where V t h WO rdline 1S mvert er threshold (relative to V dd ). This is significantly higher than the 
voltage (V t ) at which the pass transistors in the memory cells begin to conduct. The use of the 
inverter threshold gives a more intuitive delay for the wordline but it can result in negative bit- 
line delays. 

The line capacitance is approximately the sum of gate capacitances of each memory cell in the 
row (a more detailed equation will be given later): 

C eq = cols x (2 x gateca Ppass (W a , 0) + C wordmetal ) 

This equation was derived by noting the wordline drives the gates of two pass transistors in each 
bit (the memory cell is shown in Figure 19). 

Once R p is found using Equation 12, the pull-up transistor's width can be found using: 

w _ p,switching 

datawordp ~ p 
P 

where ^^j^ is a constant that was discussed in Section 5.1.2. When calculating 
capacitances, we will assume that the width of the NMOS transistor in the driver is half of 

^datawordp' 

6.2.2. Wordline Delay 

There are two components to the word-line delay: the time to discharge the input of the 
wordline driver, and the time for the wordline driver to charge the wordline. 

Consider the first component. The capacitance that must be discharged is: 

C eq = draincap n (W decimn , 1 ) + draincap p {W decimp , 1 ) + 
gatecap( W datawordp +0.5 W datawordp , 20 ) 

The equivalent resistance of the pull-down transistor is simply 

R - res (Wj ■ ) 

eq n,on K decinvw 

The delay is then 

^dec,3 

T word,l ~ dela y r ise( R e q XC eq ' " ~ ' v thdecinv ' 1 'thworddrive^ ( 13 ) 

thdecinv 

where 7j ec 3 is the delay of the final decoder stage (from Equation 10). Note that in general, 
v thworddrive w ^ depend on the size of the wordline driver. If a constant ratio between the widths 
of the NMOS and PMOS driver transistors is used, however, the threshold voltage is almost 
constant. 

From the previous section, the delay of the second stage is approximately 

T word,2,approx = k rise X l n ( c °l s ) x v thwordline 

This equation, however, does not take into account changes in the input slope or wiring resis- 
tances and capacitances. To get a more accurate approximation, Horowitz's equation can once 
again be used, with: 
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cols xR 



R eq = reS n.nn( W , 



p,on ^ datawordp 



) + ■ 



wordmetal 



C eq = 2 cols xgatecap pass (W a , BitWidth -2W a ) + 

draincap p ( W datmmrdp , 1 ) + draincap n ( 0.5 W datawordp , 1 ) + cols x C wordmetal 

The quantity BitWidth in the above equation is the width (in urn) of one memory cell. 
Using C and R , the time to charge the wordline is: 



T word,2 = dela yfall( R e q XC e q > 



' v thworddrive ' v thwordline^ 



V thworddrive 

Equations 13 and 14 can then be combined to give the total wordline delay: 

T — T _i_ T 

wordline, data word,l word,2 



(14) 



(15) 



6.2.3. Analytical and Hspice Comparisons 

As before, the analytical model was compared to results obtained from Hspice simulations. 
The technology parameters and transistor sizes shown in Appendices I and II were used, and the 
results in Figure 16 were obtained. The wordline in the Hspice deck was split into 8 sections, 
each section containing one eighth of the memory cells. The sections were separated by one 
eighth of the wordline resistance. As the graph shows, the equation matches the Hspice 
measurements very closely. 
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Figure 16: Wordline results 
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6.3. Tag Wordline 

Unlike the driver for the data array wordlines, it was assumed that the size of the wordline 
driver in the tag array is constant for all cache sizes since the tag array is (usually) much nar- 
rower than the data array. 

Figure 14 can be used to estimate the delay of the tag wordline. Again, there are two com- 
ponents to the delay: the time to discharge the wordline driver, and the time to charge the 
wordline itself. For the first component, the previous equations can be used: 

C eq =draincap n (W decimm , 1 ) + draincap p (W deciavp , 1 ) + gatecap( W tagwordp + W tagW()rdn , 20 ) 



^eq res n,on ( ^decinvr) 



T 

dec,3 



^tagword,\ delay r i se (R e q'XC eq , > v thdecinv > v thtagworddrive^ 

thdecinv 

where T dec 3 is the delay of the final decoder stage (from Equation 10). Note that in these equa- 
tions, W tagwor( jp and W tagwordn are constants (unlike the equations in the previous section). 

The second component is slightly different. If an address contains b addr bits, then the number 
of bits in a tag is: 

tagbits = b a(ldr — log 2 ( cache size in bytes) + log 2 ( associativity ) + 2 (16) 
The "+2" is because of the valid and dirty bits. This quantity can then be used in: 

tagbits x R wordmetal 

eq res p,on^ V tagwordp> + 2 

C eq = 2 tagbits x gatecap pass (W a , BitWidth-2xW fl ) + 

draincap p ( W tagwordp , 1 ) + draincap n ( W tagwordn , 1 ) + tagbits X C wordmetal 

to give 

T 

T -7 1 (J) C tagword,l . , 17 . 

1 tagword,2 ~ dela yfall { - K eq XL eq ' ' V thtagworddrive ' v thwordline> ^ 

thtagworddrive 

The equations for T tagword l and 

Tfagword,2 can men ^ e summe d to give the total delay at- 
tributed to the tag array wordline: 

wordlinejag tagword,! tagword,2 v lc v 

Figure 17 shows the wordline delay times from both the analytical model (solid line) and ob- 
tained from Hspice (dotted line) using our CMOS 0.8(im process. In this case, the wordline in 
the Hspice deck was divided into four sections, each section separated by one quarter of the total 
wordline resistance. As can be seen, the two models agree very closely. 
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Figure 17: Wordline of tag array 

6.4. Bit Lines 

6.4.1. Bitline architecture 

Each column in the array has two bitlines: bit and bitbar. After one of the wordlines goes 
high, each memory cell in the selected row begins to pull down one of its two bitlines; which 
bitline drops depends on the value stored in that memory cell. 

In most memories, the bitlines are initially precharged high and equilibrated using a circuit 
like the one shown in Figure 18. During the precharge phase, both bitlines are charged to the 
same voltage, V bitpre . The four NMOS transistors in the figure are connected as diodes; their 
only purpose is to drop the precharged voltage from V dd to V bitpre (in our process, V dd is 5 volts 
and V bitpre is 3.3 volts). The sense amplifier that will be described in the next section requires 
that the common mode voltage on the bitlines be less than V dd . 

A typical SRAM cell is shown in Figure 19. When the wordline goes high, the W a transistors 
will begin to conduct, discharging one of the bitlines. 

In many memories, a column select multiplexor is used to reduce the number of sense 
amplifiers required. Figure 20 shows such a multiplexor. The gate of the pass transistor is 
driven by signals from the output of the decoder. In this paper, the degree of multiplexing, that 
is, the number of bitlines that share a common sense amplifier, is N j x N^. Thus, 85A sense 
amplifiers are required for the data array. Notice that the output of the column multiplexor is 
precharged during the precharge phase (the precharging circuit is not shown, but is the same as 
Figure 18). 
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Figure 20: Column select multiplexor 
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6.4.2. Bitline delay 

The delay of the bitline is defined as the time between the wordline going high (reaching 
Vthwordline) an( * one °f me bitlines going low (reaching a voltage of V bitsense below its maximum 
value). 

6.4.3. Equivalent Circuit 

As previously mentioned, in each row, either bit or bitbar will go low. Consider the case 
when bit goes low. The equivalent circuit in Figure 21 can be used. The transistors labeled W a 



line 

-vw 



R 



colmux ' 



colmux 



line 



mem 



Figure 21: Bitline equivalent circuit 

and W d have been replaced by a resistor with resistance R mem . The capacitance C^ ne includes 
the capacitance of the memory cells sharing the bitline, the metal line capacitance, the drain 
capacitance of the precharge circuit, and the drain capacitance of the column multiplexor pass 
transistor: 



1 



C line = ( rows ) X (-2 drainca P n ( W a> 1 ) + C bitmetal) + 

2draincap p ( W bitpreequ , 1 ) + draincap n (W bitmum , 1 ) 



(19) 



where 



rows - 



C 



BA N dbl N spd 



(20) 



The drain capacitance of each W a transistor is divided by two since each contact is shared be- 
tween two vertically adjacent cells. 

The capacitance C colmux in Figure 21 is the capacitance seen by the output of the conducting 
column multiplexor pass transistor. It includes the drain capacitance of all pass transistors con- 
nected to this sense amplifier and the input capacitance of the sense amplifier: 

^colmux = ( N s P d N dwO x draincap n ( W bitmum , 1 ) + (21) 
2 gatecap ( ^ sense Q\ t0 \ > 10) 



The resistance R colmux is simply: 

R , -res < W, ■ ) 

colmux n,on y bitmuxn' 



(22) 



by: 



If ^dbl x ^spd = 1' men there is no column multiplexors, and equations 19 to 22 can be replaced 
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C line = ( rows ) x 4 drainca P n ( W a > 1 ) + C bitmetal 1 + 2 drainca Pp ( , 1 ) (23) 

R — 0 

colmux 

The resistances ^ mem and i? ftne do not depend on the value of N db jxN spd . R mem is the equiv- 
alent resistance of the conducting transistors in the memory cell: 

R mem = ^W^' 1 * + res n,on( W d' 1 ) ( 24 ) 

Finally, R [ine is the metal resistance of the bitline. As before, we assume that the resistance is 
clumped rather than distributed over the entire line. Thus, 



_ row^ 

^line ~ 2 bimetal y/ " } > 



where the number of rows is as in Equation 20. 

6.4.4. Equivalent circuit solution 

Figure 21 can be viewed as an RC tree as described in [7]. Using the simple single time 
constant approximation, the delay can be written as: 

V bitpre 

T s tep ~ [ ^mem ^line + ( ^mern + ^line + ^ colmux ) ^colmux ] ^ n (y Zy ^ 

bitpre bitsense 

6.4.5. Non-zero wordline rise times 

Equation 26 assumes that there is a step input on the wordline. This subsection describes how 
the non-zero wordline rise time can be taken into account. 

Figure 22 shows the wordline voltage (the input to the bitline circuit) as a function of time 
assuming a step input on the wordline. The time difference T step shown on the graph is the time 
after the wordline rises until the bitline reaches V bitpre - V bitseme (the bitline voltage is not 
shown on the graph). T st is given by Equation 26. During this time, we can consider the 
bitline being "driven" towards V bitpre - V bitsense . Because the current sourced by the access tran- 
sistor i can be approximated as 

the shaded area in the graph can be thought of as the amount of charge discharged before the 
output reaches V bitpre - V bitsense . This area can be calculated as: 

area = T step X ( V dd- V t) 

(V { is the voltage at which the NMOS pass transistor begins to conduct). 

Since the same amount of discharging is required to drive the bitline to V bitpre - V bit ^ en ^ e 
regardless of the shape of the input waveform, we can also calculate the bitline delay for an 
arbitrary wordline waveform. Consider Figure 23. If we assume the area is the same as in 
Figure 22, then we can calculate the value of T bitline data . This value is the corrected bitline 
delay (V s is the switching point of the NMOS pass transistor). Using simple algebra, it is easy to 
show that 
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Figure 22: Step input on wordline 
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where 



m 



and m is the slope of the input waveform. 
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Figure 23: Slow-rising wordline 



(27) 



If the wordline rises quickly, as shown in Figure 24, then the algebra is slightly different. In 
this case, 



V dd +V t _ \ 

bitline,data step 2 m m 



(28) 



The cross-over point between the two cases for T bitline occurs when: 



7\ ,„ = 



v dd -v t 



step 2 m 



or 
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Figure 24: Fast-rising wordline 

Vdd-Vf 
^ 1 step 

Calculating the slope, m, for a given wordline rise time is simple. From section 6.2: 
rise time = k rise x In (cols) 
Thus, 



£m e xln (cols) 

For most practical cases, the input rise time is slow enough that Equation 27 should be used. 
However, it is always important to check the size of m before calculating T bitline data . 

6.4.6. Analytical vs. Hspice Results 

Again, an Hspice model was used to validate the analytical equations. Figure 25 shows the 
bitline delay for an array with 128 columns and no column-multiplexing. The lower two lines 
show the analytical and Hspice results assuming a step input on the wordline. The upper lines 
show the results if a non-zero wordline rise time is assumed. As the graph shows, for a wide 
range of array sizes (number of rows) the analytical predictions closely match the Hspice results. 
(The bitlines appear to have a negative delay for very small numbers of rows due to the relative 
thresholds used for the wordline and bitline delays.) 

Figure 26 shows the same thing if 8-way column multiplexing is used; that is, a single sense 
amplifier is shared among 8 pairs of bitlines. The error is somewhat larger than in Figure 25, but 
the analytical and Hspice results are still within 0.1ns of each other. 

Figure 27 shows the bitline delay as a function of the number of columns in the array. In the 
ideal case, in which the wordline rise time does not affect the bitline delay, the number of 
columns should have no effect. Indeed, the corresponding lines in Figure 27 are flat. The other 
two curves show how well the approximate method to take into account a non-zero rise time 
works. 
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Figure 25: Bitline results without column multiplexing 
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Figure 26: Bitline results with column multiplexing 
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Figure 27: Bitline results vs. number of columns 

Finally, Figure 28 shows how the delay is affected by the degree of column multiplexing. As 
expected, the larger the degree of multiplexing, the higher the delay, since more capacitance 
must be discharged when the bitline drops. Again, there is good agreement between the Hspice 
and analytical results. 

6.4.7. Tag array bitlines 

The equations derived in this section can be used for the tag array bitlines as well. The only 
difference (besides replacing the data array organizational parameters with the corresponding tag 
array parameters) is the calculation of the input rise time. The wordline rise time can be ap- 
proximated by: 

T 

taeword,2 
rise time 

v thwordline 

where T tagword2 was given in Equation 17. This leads to: 

^ 7 dd v thwordline 

m - 

T 

tagword,2 

The value for T bit[ine t can then be obtained from either Equation 27 or 28. 
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Figure 28: Bitline results vs. degree of column multiplexing 



6.5. Sense Amplifier 

Wada's sense amplifier (reproduced in Figure 29) amplifies a voltage difference of 2xV bitsense 
to V dd . In [8], an approximation of the delay of the sense amplifier is written in terms of various 
process parameters. In this model, we encapsulate several of these parameters into a single 
process parameter, t seme data , which is the delay of the sense amp: 

^ "sense, data ~ ^sense,data (29) 

The value of t sense data can be estimated from Hspice simulations. Figure 30 shows the delay 
measured from Hspice and the constant delay predicted by the model as a function of input fall- 
time (neither the analytical model used here nor Wada' s model took into account the effects of a 
non-zero bitline fall time). As the graph shows, the error is small. 



The delay of the sense amplifier in the tag array can also be approximated by a constant: 



T = t 

sensejag sensejag 



(30) 



Comparing Figures 31 and 30, t sense t is less than t sense data , even though the structures of the 
sense amplifiers are identical. The difference is due to the output capacitance driven by each 
sense amp; the sense amplifier in the tag array drives a comparator, while the data array sense 
amplifier drives an output driver. 
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Figure 29: Sense amplifier (from [8]) 
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Figure 30: Data array sense amplifier delay 
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Figure 31: Tag array sense amplifier delay 

The following stages will also require an approximation of the fall time of the sense amplifier 
output. A constant fall time for each sense amp was assumed. The fall times will be denoted by 
tfall sense data and tfall sense tag ; the values for our process are shown in Appendix I. 



6.6. Comparator 

6.6.1. Comparator Architecture 

The comparator that was modeled is shown in Figure 32. The outputs from the sense 
amplifiers are connected to the inputs labeled b n and b n -bar. The a n and <3„-bar inputs are driven 
by tag bits in the address. Initially, the output of the comparator is precharged high; a mismatch 
in any bit will close one pull-down path and discharge the output. In order to ensure that the 
output is not discharged before the b n bits become stable, node EVAL is held high until roughly 
three inverter delays after the generation of the b n -bdx signals. This is accomplished by using a 
timing chain driven by a sense amp on a dummy row in the tag array. The output of the timing 
chain is used as a "virtual ground" for the pull-down paths of the comparator. When the large 
NMOS transistor in the final inverter in the timing chain begins to conduct, the virtual ground 
(and hence the comparator output if there is a mismatch) begins to discharge. 

6.6.2. Comparator Delay 

Since we assume that the a n and b n bits will be stable by the time EVAL goes low, the critical 
path of the comparator is the propagation delay of the timing chain plus the time to discharge the 
output through the NMOS transistor in the final inverter. 

First consider the timing chain. The chain consists of progressively larger inverters; we as- 
sume that size of each stage is double the size of the previous stage (see Appendix I). Each stage 
can be worked out using a simple application of Horowitz's approximation described in Section 
5.5: 
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Figure 32: Comparator 

= reS p,on( W compinvpO X \-S ateca P( W compinvn2 +W compinvp2 > 10 ) + 
draincap p (W compinvpl , 1) + draincap n (W compinvnl , 1)] 

l f,2 = reS n,orl W compinvnl) X \-S ateca P( W compinvn3 +W compinvp3 > 10 ) + 
draincap p (W compinvp2 , 1) + drmncap^W^^ ,1)] 

f /;3 = re W W CO mpwvp3) X t ^va/mvn + W e v a ««vp > 10 ) + 

draincap p (W compimp3 , 1) + drmncap^W^^ ,1)] 

^comp, 1 ~ delay f a ii(tf y i , tf al1 sense, tag ' V thcompinvl ' v thcompinv2^ 



T comp,2 = dela yrise^f,2 



thcompinvl 



T comp,3 = delay fan ( ^3 > 7— j 



7,2 



thcompinv3 



' v thcompinv2 ' v thcompinv3 ) 



' v thcompinv3 ' v thevalinv^ 



OUT 




The final stage involves discharging the output through a pull-down path and the NMOS tran- 
sistor of the final inverter driver. An equivalent circuit is shown in Figure 33. The resistance 
^eva/n * s tne resistance of the pull-down transistor in the final inverter: 



^evaln r ^^n,switching ( ^eva/wvn 



) 



In the worst case, only one pull-down path is conducting; the resistance Rp U nd own is the path's 
equivalent resistance. Since it was assumed that the inputs are stable when the evaluation takes 
place, we are interested in the full-on resistance of the pull-down path: 



R - 9 ret 

pulldown n,on 



(W ) 

v compn ' 



In Figure 32, it is clear that approximately half of the capacitance is on the input side of the 
pull-down path, and half is on the output side. These capacitances will be denoted by C eqbot and 
C eqtop . The first can be written as: 
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Figure 33: Comparator equivalent circuit 

c eqbot = tabits x l drainca P n ( W com P nV + draincap n ( W n ,2) ] + 
draincap p ( W gvalinvp , 1 ) + draincap n ( W evalinvn , 1 ) 

where tagbits is the number of tag bits (from Equation 16). Note that there are 2xtagbits pull- 
down paths (two for each bit); half of the "off" paths have the top transistor off, and half have the 
bottom transistor off. We have also included the drain capacitances of the final inverter stage. 

The capacitance C eqtop can be written similarly: 

C e9 to P = tabits x ( drainca P n ( W compn>V + drainca Pn( W compn> 2 » + draincap p { W compp , 1 ) + 
gatecap(W muxdrvln + W muxdrvlp , 20) + tagbits x N tbl N tspd C wordmetal 

The output capacitance is taken to be the input capacitance of either the multiplexor driver 
described in Section 6.7 or the valid signal driver in Section 6.9 (the first stage of both structures 
are the same, so they have the same input capacitance). We have also included metal 
capacitance of the metal output (it is assumed that the metal crosses the entire width of the tag 
array). 

The circuit in Figure 33 is equivalent to the circuit in Figure 21, so the same solution can be 
used. The result is: 

(31) 

^step ~ evaln ^ eqbot + ^evaln + ^pulldown) ^eqtop^ ^ n (~ 

thmuxdrvl 

The non-zero input fall time can be taken into account using the same method as Section 6.4. 
There are two possible equations; which one should be used depends on the slope of the input. 
For a slow rising input: 



-2 (v f - v t ) + V4 (v s - v t ) 2 - 4xmxc 

^ eval 

where 



T eval ^ (32) 



1 ' ^2 



C = -(Vs-Vtr- 2T ste P (Vdd-V t ) 



and m is the slope of the input waveform: 



T 

comp,3 
m - 



thevalinv 

The above equation should be used when: 
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m < 



2 T. 



step 



For a quickly rising input (m greater than 



2T. > 



T — T + 

eval step 



m 



The above equations can be combined to give the total delay due to the comparator: 



T — t + T + T + T 

compare compl compl comp3 eval 



(33) 



(34) 



6.6.3. Hspice Comparisons 

Figure 34 compares the analytical model to a Hspice model of the circuit. As can be seen, 
there is good agreement between the two models. 
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Figure 34: Comparator delay 



6.7. Multiplexor Driver 

In a set-associative cache, the result of the A comparisons must be used to select which of the 
A possible blocks are to be sent out of the cache. The structure of the output multiplexors will be 
described in the next section; here we concentrate on the driver that drives the select lines of 
these multiplexors. Figure 35 gives the overall context of the multiplexor drivers and output 
driver circuits in an A-way set-associative organization. Each multiplexor driver is responsible 
for controlling the multiplexing of the 85 bits from each cache line onto a data bus that reads out 
b Q bits. This is repeated A times in an A-way set- associative cache. 
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Figure 35: Overview of data bus output driver multiplexors 
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6.7.1. Multiplexor Driver Architecture 

Figure 36 shows the structure of the three-stage multiplexor driver that we have assumed 
(since there are A comparators, A copies of this block are required). The output of the com- 
parator is first inverted. This inverted match signal, matchb, is used to drive y- NOR gates 

o 

(recall that b 0 is the number of output bits of the cache). The other inputs to the NOR gates are 
derived from the address bits; we assume they are stable before the comparator result is valid. 
The output of each NOR gate is again inverted (to produce selb) and the inverted signal is used 
to drive the select lines of b Q multiplexors. 
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Figure 36: One of the A multiplexor driver circuits in an A-way set-associative cache 
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6.7.2. Multiplexor Driver Delay 

The delay of the three stages can be found in a manner similar to that used for the previously 
discussed circuits. Using Horowitz's approximation gives: 

8 B 

l fl = res p,on^ W muxdrvlp) x [ y gatecap{ W muxdrvmrn + W muxdrvnorp , 15) + 

draincap p (W muxdrvlp , 1 ) + draincap n (W muxdrvln ,1)] 

l f2 = res n,on( W muxdrvnom) X ^8 ateca P( W muxdrv3n + W muxdrv3p > 15 ) 
draincap p ( W muxdrvnorp ,2) +2 draincap n ( W muxdrvnom , 1 ) ] 

l f3 = ^ es p,on( W muxdrv3p^ + B AN spd N dblKordmetal 1 X 

[ gatecap( W ()utdrvseln + W outdrvse i p + W outdrvnorn + W outdrvnorp , 35) + 

dramcap p ( < W mixdn/ip , 1 ) + draincap n { W muxdrv3n , 1 ) + 

4BAN spd N dbl C wordmetal ] 
y ^ 

^muxdr,! ~ dela} ?f a n(tf\ , . — — , v thmuxdrvl ' v thmuxdrvnor^ 

thmuxdrvl 

, , , *muxdr,\ . 

muxdr,2 ~ a ^ rise ^ f,2 ' ~ ' v thmuxdrvnor ' v thmuxdrv3 ' 

thmuxdrvnor 

T - A 1 (t *muxdr,2 - 

J muxdr,3 ~ aela ^ 'fall" ft ' TZT^ ' V thmuxdrv3 ' v thoutdrvsel' 

thmuxdrv3 

Note that in the second-stage NOR gate, we have assumed that only one pull-down path is con- 
ducting. Also, we have included the metal resistance of the sefe line (we assume the line travels 
half the width of the cache). 

The total multiplexor driver delay is simply: 

T — T + T + T 

muxdriver muxdr, 1 muxdr,2 muxdr,3 

6.7.3. Hspice comparisons 

Figures 37, 38 and 39 show the delay for the analytical and Hspice models as a function of 

8 B 

b addr number of NOR gates (-^-), and £ 0 (although the delay is not strictly a function of b addr 

o 

the fall time of the comparator is, and this affects the multiplexor delay through Horowitz's in- 
verter approximation). As the graphs show, the analytical model matches the Hspice results very 
well. 



6.8. Output Driver 
6.8.1. Architecture 

The structure of the output driver is shown in Figure 40. Each sense amplifier in the data 
array drives the senseout input of an output driver; since there are SB A sense amplifiers, there 
are that many output drivers. Typically, the number of cache output bits, b is less than SB A. 
Therefore, each of the output drivers is actually a tri-state driver. Each driver is turned on and 
off using one of the selb signals generated in the multiplexor driver described in Section 6.7. 

There are two potential critical paths through the output driver. In a set- associative cache, the 
correct output can not be driven until both the senseout and selb signals are stable. Two expres- 
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Figure 37: Multiplexor driver delay as a function of b addr 
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Figure 38: Multiplexor driver delay as a function of 



8B 



sions will be derived. The first is the time after selb becomes stable until the inverted signal at 
the NAND input is valid. The second expression is the time after sel and senseout are both valid 
until the output of the driver is stable. Section 6.11 will show how these two quantities are used 
when estimating the overall cache delay. 
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Figure 39: Multiplexor driver delay as a function of b 0 
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Figure 40: Output driver 

6.8.2. Select invert stage 

The time constant of the inverter that inverts selb can be estimated as: 

tfMv = reS n,on( W outdrvs e ln) X (g ateca P( W outdrvnandn +W outdrvnandp > 10 ) 
+ draincap p (W outdrvselp , 1 ) + draincap n {W ()Utdrvseln , 1 ) 

which gives: 

L 



1 outdrive, inv delay risg ( 



muxdr,3 



' v thoutdrvsel ' v thoutdrvnand^ 



' thoutdrvsel 

Figure 41 compares the model predictions and Hspice measurements. 
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Figure 41: Output driver delay as a function of b 0 : selb inverter 

6.8.3. Data output path 

We assume the transistor sizes in the NAND and NOR gates are such that the delay through 
each is the same. This model will include the delay through the NOR gate as follows: 

tfwr = 2res p,on( W outdrvnorp) X lg ateca P( W outdrivern> 10 ) + 

draincap p (W outdrvnandp , 2) + 2draincap n (W outdrvnmdn , 1] 



7 = 



delay f a n(tf nor , tfall sense data , v thoutdrvnor , v thoutdriver ) 



The output driver stage will be treated as an inverter: 

% BAN spd n vstack 
7 'final ~ L res p,ork W outdriverp > + K wordmetal ^ J X 



[ _ ( draincap p {W outdriverp , 1 ) + draincap n {W outdHvern , 1 )) 
C wordmetal X< $ BAN spd n vstack) + C out^ 



7 final ~ dela y r ise^J 'final ' 7T 



' v thoutdriver ' ^-5 ) 



thoutdriver 

where C out is the output capacitance of the cache and n vstack is the number of arrays stacked 
vertically (the arrays are assumed to be laid out so as to make the resulting structure roughly 
square). 



The total delay of the second part of the driver can then be written as: 

T — T + T 

outdrive,data nor final 

Figure 42 shows the analytical and Hspice estimations of T QUtdrive data . 



(36) 
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Figure 42: Output driver delay: data path 



6.9. Valid Output Driver 

The comparator also drives a valid output. In a set-associative cache, this driver is not on the 
critical path, but in a direct-mapped cache, it could be. Thus, it is necessary to estimate the delay 
of this driver. 

The valid signal driver is simply an inverter with transistor widths of W muxdrvln and 
W muxdrvlp . The equations for the delay of the driver are: 

tf = res pon (W muxdrvlp ) x 

[draincap n ( W muxdrvln , 1) + draincap p ( W muxdrvlp , 1) + C QUt ] 

j 

7 valid ~ delay f a u ( tf > T~~ > v thmuxdrvl > °" 5 ) 

thmuxdrvl 

Figure 43 shows the analytical and Hspice delays as a function of the number of bits in a tag. 
The number of tag bits affects the comparator fall time; this affects the valid output driver delay. 



6.10. Precharge Time 

This section derives an estimate for the extra time required after an access before the next 
access can begin. This difference between the access time and the cycle time can vary widely 
depending on the circuit techniques used. Usually the cycle time is a modest percentage larger 
than the access time, but in pipelined or post-charge circuits [6, 1] the cycle time can be less than 
the access time. We have chosen to model a more conventional structure with the cycle time 
equal to the access time plus the precharge. 
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Figure 43: Valid output driver delay 

Since the decoder, comparator, and bitlines need to be precharged, the extra time can be writ- 
ten as: 

cycle time - access time = max ( data wordline fall time + data bitline charge , 

tag wordline fall time + tagbitline charge , 
comparator charge time , decoder charge time ) 

(note that the asserted wordline has to fall before the bitlines can be discharged). 

The precharge times for the four precharged elements are somewhat arbitrary, since the 
precharging transistors can be scaled in proportion to the loads they are driving, while presenting 
a parasitic capacitance proportional to the other loads. If this is done then only the delay driving 
the precharge transistors changes with cache size. The precharge transistors are assumed to be 
driven by a taper buffer during the time the wordline is being discharged, so this time is not 
included in the precharge time. We assume that the time for the wordline to fall and bitline to 
rise in the data array is the dominant term in the above equation; therefore, the comparator, 
decoder, and tag bitline charge time will also be ignored. 

Assuming the wordline drivers have properly ratioed NMOS and PMOS transistors, the 
wordline fall time is the same as the wordline rise time derived earlier. After the wordline has 
dropped, it is necessary to wait until the two bitlines (bit and bitbar) are within V bitsens J2 of 
each other. It is assumed that the bitline precharging transistors are such that a constant (over all 
cache organization) bitline charge time is obtained. This constant will, of course, be technology 
dependent. In the model, we assume that this constant is equal to four inverter delays (each with 
a fanout of four): 

T preb Mine-^elay faU (t f , 0,0.5 ,0.5) 
where 

*f = res P ,on ( W decinv P ) x Wraincap n ( W decimn , 1) + 

draincap p ( W decinvp , 1 ) + 4 gatecap( W dedmp + W decinvn , 0 ) ] 
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The total precharge time can, therefore, be written as: 

precharge wordline,data prebitline v ' 

6.11. Access and Cycle Times 

This section shows how the equations derived in the previous sections can be combined to 
give the hit access and cycle times of a cache. First consider the access time (the time after the 
address inputs are valid until the requested data is driven from the cache). For a direct- mapped 
cache, the access time is simply the larger of the path through the tag array or the path through 
the data array: 

^access,dm ~ max ( ^ dataside + ^ outdrive, data ' ^tagside,dm (38) 

where: 

T — t + T + T + T 

dataside decoder,data wordline,data bitline,data sense,data 

T — T + T + T + T + T + T } 

tagside,dm decoderjag wordlinejag bitline,tag sense,tag compare valid ' 

In a set-associative cache, the tag array must be read before the data signals can be driven. 
Thus, the access time of a set- associative cache can be written as: 

^access,sa ~ max ( ^ dataside ' ^tagside,sa ) + "^outdrive, data (39) 

where: 

t — t +T + T + T + T + 

tagside,sa decode,tag wordline,tag bitlinejag sensejag compare 

T + T 

muxdriver outdrivejnv 

Figures 44 to 47 show analytical and Hspice estimations of the data and tag sides for direct- 
mapped and 4-way set-associative caches. To gather these results, the model was first used to 
find the optimal array organization parameters via exhaustive search for each cache size. These 
optimum parameters are shown in the figures (the six numbers associated with each point cor- 
respond to N dwl , N dbl , N spd , N w i, N tbl , and N tspd in that order). The parameters were then used 
in the Hspice model. As the graphs show, the difference between the analytical and Hspice 
results is less than 10% in every case. 

The cycle time of the cache (the minimum time between the start of consecutive accesses) is 
the access time plus the time to precharge the bitline, comparator, and internal decoder bus. 
Using Equation 37, the cycle time can be written as: 

cycle access precharge ^ - 1 

7. Applications of Model 

This section gives examples of how the analytical model can be used to quickly gather data 
that can be used in architectural studies. 
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Figure 45: Direct mapped: T 
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Figure 46: 4-way set associative: T dataside + ? outMve4ata 
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7.1. Cache Size 

First consider Figures 48 and 49. These graphs show how the cache size affects the cache 
access and cycle times in a direct-mapped and 4-way set- associative cache. In these graphs (and 
all graphs in the remainder of this report), b 0 = 64 and b addr = 32. For each cache size, the op- 
timum array organization parameters were found (these optimum parameters are shown in the 
graphs as before; the six numbers associated with each point correspond to N dwl , N dbl , N spd , 
N w i, N tbl , and N t d in that order), and the corresponding access and cycle times were plotted. In 
addition, the graph breaks down the access time into several components. 
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Figure 48: Access/cycle time as a function of cache size for direct-mapped cache 
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Cache Size (B=16, A=4) 



Figure 49: Access/cycle time as a function of cache size for set-associative cache 

There are several observations that can be made from the graphs. Starting from the bottom, it 
is clear that the time through the data array decoders is always longer than the time through the 
tag array decoders. For all the organizations selected, there are more data subarrays (N dw i><N db i) 
than tag subarrays (N tw iXN tb j). The more data arrays, the slower the first decoder stage. 

In all caches shown, the comparator is responsible for a significant portion of the access time. 
Another interesting trend is that the tag side is always the critical path in the cache access. In the 
direct- mapped cases, organizations are found which result in very closely matched tag and data 
sides, while in the set-associative case, the paths are not matched nearly as well. This is due 
primarily to the delay of the multiplexor driver. 
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7.2. Block Size 

Figures 50 and 51 show how the access and cycle times are affected by the block size (the 
cache size is kept constant). In the direct-mapped graph, the access and cycle times drop as the 
block size increases. Most of this is due to a drop in the decoder delay (a larger block decreases 
the depth of each array and reduces the number of tags required). 
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Figure 50: Access/cycle time as a function of block size for direct-mapped cache 

In the set-associative case, the access and cycle time begins to increase as the block size gets 
above 32. This is due to the output driver; a larger block size means more drivers share the same 
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Figure 51: Access/cycle time as a function of block size for set-associative cache 

cache output line, so there is more loading at the output of each driver. This trend can also be 
seen in the direct-mapped case, but it is much less pronounced. The number of output drivers 
that share a line is proportional to A, so the proportion of the total output capacitance that is the 
drain capacitance of other output drivers is smaller in a direct-mapped cache than in the 4-way 
set associative cache. Also, in the direct-mapped case, the slower output driver only affects the 
data side, and it is the tag side that dictates the access time in all the organizations shown. 
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7.3. Associativity 

Finally, consider Figures 52 and 53 which show how the associativity affects the access and 
cycle time of a 16KB and 64KB cache. As can be seen, there is a significant step between a 
direct-mapped and a 2-way set-associative cache, but a much smaller jump between a 2-way and 
a 4-way cache (this is especially true in the larger cache). As the associativity increases further, 
the access and cycle time begin to increase more dramatically. 




Associativity (C=16K, B=16) 



Figure 52: Access/cycle time as a function of associativity for 16K cache 
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Associativity (C=64K, B=16) 



Figure 53: Access/cycle time as a function of associativity for 64K cache 

The real cost of associativity can be seen by looking at the tag path curve in either graph. For 
a direct-mapped cache, this is simply the time to output the valid signal, while in a set- 
associative cache, this is the time to drive the multiplexor select signals. Also, in a direct- 
mapped cache, the output driver time is hidden by the time to read and compare the tag. In a 
set-associative cache, the tag array access, compare, and multiplexor driver must be completed 
before the output driver can begin to send the result out of the cache. 

Looking at the 16KB cache results, there seems to be an anomaly in the data array decoder at 
A=2. This is due to a larger Nj w{ at this point. This doesn't affect the overall access time, 
however, since the tag access is the critical path. 
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Another anomaly appears for large associativities, in which the bitline appears to have a nega- 
tive delay. For slow rising wordlines, the bitline can switch completely (recall that the two bit- 
lines only need to be V sense volts apart) before the wordline has reached its gate threshold volt- 
age. 

8. Conclusions 

In this report, we have presented an analytical model for the access and cycle time of a cache. 
By comparing the model to an Hspice model, the model was shown to be accurate to within 
10%. The computational complexity, however, is considerably less than Hspice; measurements 
show the model to be over 100,000 times faster than Hspice. 

It is dangerous to make too many conclusions from the graphs presented in this report. 
Figures 52 and 53 seem to imply that a direct-mapped cache is always the best. While it is 
always the fastest, it is important to remember that the direct-mapped cache will have the lowest 
hit-rate. Hit rate data obtained from a trace-driven simulation (or some other means) must be 
included in the analysis before the various cache alternatives can be fairly compared. Similarly, 
a small cache has a lower access time, but will also have a lower hit rate. In [4], it was found 
that when the hit rate and cycle time are both taken into account, there is an optimum cache size 
between the two extremes. 
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Appendix I. Circuit Parameters 

The transistor sizes and threshold voltages used in the circuits in this report are given in Table 
1-1 (all transistor lengths are 0.8(im). 



Stage 


Symbol 


Value 


Decoder Driver 


^ decdrivep 


lOOum 




d c cdviven 


50um 




thds cdvivG 


0.438 


Decoder N AND 


LlCL, J Us O Is 


60|im 




U.C L JH/Oll 


90um 




'thApr^,fnR 

IflLlCL JUsQ 


0.561 


Decoder NOR 


A PCYtn rn 

Lll. L. / US t IS 


\2\im 




^ 7 decnovn 


2.4um 




^ ' thdecnor 






(one input) 


0.503 




(two inputs) 


0.452 




(three inputs) 


0.417 




(four inputs) 


0.390 


Decoder inverter 


A e>r\w\)Yi 
uce in yfs 


lOum 




^ 7 decinvn 


5(im 




^ thdecinv 


0.456 


Wordline driver 


w 

Wist LlLlt I Vt Is 


varies 




wo vd dflV€H 


varies 




tliiA)nrAA~fi~\)£> 

IUVVLSI LI Li 1 l vc 


0.456 






0.4ns 


Tag wordline driver 


tagworap 


lOum 




icigwOf an 


5um 




^ thtagworddrive 


0.456 


Memory Cell (Fig 19) 


W a 


lum 




W b 


3(im 




W d 


4um 




V thwordline 


0.456 




BitWidth 


8.0um 




BitHeight 


16.0|im 


Bitiines 


w, . 

bitpreequ 


80um 




^bitmuxn 


lOum 




V, ■ 

bitpre 


3.3 volts 




^bitsense 


0.1 volts 




V t 


1.09 volts 
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Sense Amp (Fig 29) 


Q1-Q4 


4|im 




Q5-Q6 


8|im 




Q7-Q10 


8(am 




Q11-Q12 


16)J.m 




Q13-Q14 


8)^m 




Q15 


16|im 




t 

sense, data 


0.58ns 






0.26ns 




^ sense, to, g 




tfa^l sense, data 


0.70ns 




sense , lug 


0.70ns 


Comparator inverter 1 


compinvp 1 


10|im 




' compinvn 1 


6um 




^ 7 thcompinv 1 


0.437 


Comparator inverter 2 


o mp in vp 2 


20|im 




coiwpinvfi'2. 


12(am 




tncompinX' l 


0.437 


Comparator inverter 3 


CUffllSlfl VUJ 


40|im 




LUlfiptfl VI ID 


24|im 




^ ' thcotnpinv3 


0.437 


Comparator eval 


^ 7 e veil in vp 


20|^m 




w 

evalinvn 


80um 




IflCVClLtflV 


0.267 


Comparator 


w 

/VKll(H) 

LUffipp 


30(am 




W 

/VKJIIl/l 

LUffiptt 


10|im 


Mux Driver Stage 1 


w 

tnuxdw 1 p 


50|im 




w 

tnuxdw 1 n 


30(am 




^ thmuxdw 1 


0.437 


Mux Driver Stage 2 


lllLlJi.Ul VllUl U 


80(am 




iiim.ui viiui it 


20(am 




^ thniuxdwnov 


0.486 


Mux Driver Stage 3 


in uxdw selp 


20|im 




w 

m it vti i~\> ct> In 


12um 




^ ' thmuxdrvscl 


0.437 


Output Driver (sel inv) 


W 

outdrvselp 


20|am 




w 

outdrvseln 


12um 




v thoutdrvsel 


0.437 


Output Driver NAND 


w 

outdrvnandp 


10|J.m 




w 

outdrvnandn 


24um 
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tilOUTarvnufla 


0.441 


Output Driver NOR 


UHtUl VflUf U 


40p.m 




^^/i / / t/'l rt?n/i i*h 
UUttll VfU/t 11 


6|im 




V thoutdrvnor 


0.431 


Output Driver (final) 


'outdriverp 


80|im 




^ outdrivern 


48(J.m 




V thoutdriver 


0.425 




r 

out 


0.5 pF 



Table 1-1: Transistor sizes and threshold voltages 
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Appendix II. Technology Parameters 

The technology parameters used in this report are given in Table II- 1. The Spice models used 
from [3] are given in Figure II- 1. 



Parameter 


Value 


bitmetal 


4.4 fF/bit 


r 

gate 


1.95fF/um 2 


Cgatepass 


1.45fF/(am 2 


r 

ndiffarea 


0.137 fF/)im 2 


r 

ndiffside 


0.275 fF/(j.m 


r 

ndiffgate 


0.401 fF/|im 


r 

ndiffarea 
i jj 


0.343 fF/)im 2 


^ndiffvidf 


0.275 fF/um 


r 

vdiffgate 


0.476 fF/)im 


r 

polywive 


0.25 fF/(am 


r 

wordtnetal 


1.8 fF/bit 


L eff 


0.8 um 


bitmetal 


0.320 H/bit 


n,switching 


25800 £l*um 


n,on 


9723 n*(a.m 


p, switching 


61200 £2*um 


p,on 


22400 H*)im 


wordmetal 


0.080 H/bit 


Vdd 


5 volts 



Table II-l: 0.8)im CMOS process parameters 



.model nt nmos ( level=3 



+ 


vto=0 . 77 


tox=l . 65e-8 


uo=570 




gamma=0 . 8 0 


+ 


vmax=2 . 7e5 


theta=0 .404 


eta=0 . 04 




kappa=l . 2 


+ 


phi=0 . 90 


nsub=8 . 8el 6 


nf s=4ell 




x j=0 . 2u 


+ 


c j=2e-4 


mj=0 .389 


c jsw=4 . OOe 


-10 


m j sw=0 . 2 6 


+ 


pb=0 . 80 


cgso=2 . le-1 0 


cgdo=2 . le- 


10 


delta=0 . 0 


+ 

* 


ld=0 . OOOlu 


rsh=0.5 ) 








.model pt pmos 


level=3 








+ 


vto=-0 . 87 


tox=l . 65e-8 


uo=145 




gamma=0 . 7 3 


+ 


vmax=0 . 00 


theta=0 .233 


eta=0 . 028 




kappa=0 . 04 


+ 


phi=0 . 90 


nsub=9 . 0el6 


nf s=4ell 




x j=0 . 2u 


+ 


c j=5e-4 


mj=0 . 420 


c jsw=4 . OOe 


-10 


mjsw=0 . 31 


+ 


pb=0 . 80 


cgso=2 . 7e-l 0 


cgdo=2 . 7e- 


10 


delta=0 . 0 


+ 


ld=0 . OOOlu 


rsh=0.5 ) 









Figure II-l: Generic 0.8um CMOS Spice parameters [3] 
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